========================================================
This analysis explores contributions to the 2016 United States Presidential Campaign from people residing in New York State. The source of the dataset is the Federal Election Commission. Information on each contribution includes, among other values, contributor name, city and postcode of residence, dollar amount contributed, candidate/campaign contributed to, and date of contribution.
The structure of the dataset is represented below. Notice the factored columns for “cand_nm”, “candi_id”, “contb_receipt_dt” and “cityt.” Since these data columns return a specific and much smaller set of possible values, these variables, along with “contb_receipt_amt” are a good starting point for EDA.
## 'data.frame': 649437 obs. of 19 variables:
## $ cmte_id : chr "C00575795" "C00575795" "C00577130" "C00577130" ...
## $ cand_id : Factor w/ 25 levels "P00003392","P20002671",..: 1 1 12 12 1 12 12 1 1 23 ...
## $ cand_nm : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 20 20 4 20 20 4 4 23 ...
## $ contbr_nm : chr "JONES TAKATA, LOUISE" "CODY, ERIN" "KEITH, SUSAN H" "LEPAGE, WILLIAM" ...
## $ contbr_city : chr "NEW YORK" "BUFFALO" "NEW YORK" "BROOKLYN" ...
## $ contbr_st : chr "NY" "NY" "NY" "NY" ...
## $ contbr_zip : Factor w/ 69031 levels "","`1136","00000",..: 7189 64481 5944 40224 58462 30808 49151 5407 62866 69031 ...
## $ contbr_employer : chr "N/A" "RUPP BAASE PFALZGRAF CUNNINGHAM LLC" "NOT EMPLOYED" "NEW YORK UNIVERSITY" ...
## $ contbr_occupation: chr "RETIRED" "ATTORNEY" "NOT EMPLOYED" "UNDERGRADUATE ADMINISTRATOR" ...
## $ contb_receipt_amt: num 100 67 50 15 100 ...
## $ contb_receipt_dt : Date, format: "2016-04-15" "2016-04-24" ...
## $ receipt_desc : chr "" "" "" "" ...
## $ memo_cd : chr "X" "X" "" "" ...
## $ memo_text : chr "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* EARMARKED CONTRIBUTION: SEE BELOW" "* EARMARKED CONTRIBUTION: SEE BELOW" ...
## $ form_tp : chr "SA18" "SA18" "SA17A" "SA17A" ...
## $ file_num : int 1091718 1091718 1077404 1077404 1091718 1077404 1077404 1091718 1091718 1146165 ...
## $ tran_id : chr "C4732422" "C4752463" "VPF7BKZ1KR1" "VPF7BKWHRY0" ...
## $ election_tp : chr "P2016" "P2016" "P2016" "P2016" ...
## $ cityt : Factor w/ 2327 levels ""," BROOKLYN",..: 1424 287 1424 269 1620 1406 1644 1424 595 1424 ...
A box plot and frequency polygon of contribution amount in dollars display large number of outliers in the dataset on both sides of zero. About 1.3% of the dataset rows are negative values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10100.0 15.0 27.0 140.1 100.0 11820.0
Zooming in on boxplot model, we can see the middle quartiles of the contribution receipt amount are 15 to 100 USD.
Zooming in on the frequency polygon shows the overwhelming majority of donations are relatively small/clustered around near the x axis. Also noticeable here are some larger typical increments in which contributions are made at 250, 500, and $1,000.
However, the most popular amounts to donate are 25, 50, $100. The second graph above shows the vast majority of contributions are reoccurring at set intervals.
Above I split the date of the contributions into months and graphed the result from the beginning of 2015 on. At first it appears (counter-intuitively) that contributions dipped towards the end of the campaign. However, we should remember that the election was held on Nov 7th 2016. Occurring only across just 7 days, the number of contributions in November is proportionally higher than any other month.
We can see here most contributions were made to just three or four candidates Further, The majority of these contributions, went to Democratic Presidential Candidates - Clinton and Sanders.
##
## NEW YORK BROOKLYN BRONX ROCHESTER STATEN ISLAND
## 206970 86953 14102 9985 7431
## BUFFALO ITHACA ASTORIA ALBANY SYRACUSE
## 6196 5840 5235 5225 3929
Once classed, New York City Boroughs account for 4 of the top 5, and 5 of the top 6 cities when sorted by the number of contributions. The NYC burrough with the least number of contributions - Staten Island is 6th, after Rochester (Rochester is included in the “Other”" category in the graph above.)
New York City including all boroughs accounts for about 53% of total number of contributions. Manhattan, which has the most contributions, accounts for 32% by itself.
The graph above is a histogram showing recurrence of contributor names throughout the data. Although some of these recurrences will be from different people with the same name, the observations are largely from repeated contributions by the same person.
We will need to take these recurring contributions into consideration when calculating average contribution amounts. It now seems to make sense to group contributions by contributor and sum these amounts to get total contributions per individual.
The original data set consisted of 649,460 observations/contributions across 18 variables/columns. After import, I have removed 23 (invalid) contributions/rows, and also added 4 columns to help classify and summarize the data. There are 25 levels of candidates, 2,327 contributor cities, and the vast majority of the contributions are dated between April 2016 to November 2016. There are a few other distinguishing structural characteristics:
Dominant categories: Contributions to Hilary Clinton comprise over 61% of the total number of contributions. Contributions to either Clinton or Bernie Sanders comprise over 88% of data. Also, 53% of contributions came from New York City boroughs.
Outliers: The average contribution is 140.10, yet the median is only 27, even the 3rd quadrant is just $100. This large difference between mean and median is due to the large pull of outliers well outside the normal range of the data.
Negative values: Negative contributions represent 1.3% of the total data. These values are refunds. There are however many negative outliers in the dataset which are further distort measures of central tendency mentioned above.
Reoccurring values: There are many reoccurring names in the data representing people making more than one donation. These repeated donations from the same individual may need to be grouped together to get a clear idea of the average contribution. Finally, there are also certain intervals which are very popular contribution amounts.
I am mainly interested in how contribution data varies across candidates. Candidate name is the main feature of the dataset and should be treated as the independent variable. I hope to extract some trends within contributions for each candidate. I expect there is a certain “type” of contribution/contributor more common to certain candidates.
Contribution amounts, number of unique contributors, contribution averages per person, and contributions per location (city) are likely to vary across candidates. The challenge will be to extract some correlations between these variables and the candidates.
I created summary columns which help to classify the number of levels to more manageable groupings. “Contb_receipt_month”" groups the date of contribution into months, “candidate_class”" parses the contributions into the 4 most popular candidates, city_class distinguishes the boroughs of New York from other cities.
Data is very unevenly distributed in dollar amounts contributed, number of contributions per city, and number of contributions per contributor.
I factored variables and coerced into date_type where appropriate. I also deleted batched contributions from the data as I determined these contributions likely originate from outside of New York State.
Grouped by total dollar receipts, contributions to Clinton surpass those of other candidates by an even wider margin than the number of contribution taken alone. Whereas the Clinton campaign accounted for 61% of the total number of donations, they account for almost 71% of total contributions in dollars. This must be because the average contribution to Clinton is larger than most other candidates.
Likewise, whereas New York City (in this case, including only Manhattan) represents 32% the total number of contributions, the city also accounts for over 50% of the total money contributed. Contributions from Manhattan, like contribution to Hilary Clinton, must be higher than average.
This leads to the question: how much of these larger contributions from Manhattan are also contributions to the Hilary Clinton campaign? I suspect a large majority.
NYC boroughs together accounts for over 65% of the total money contributed, versus 53% total number of contributions.
How popular were the different candidates across the burroughts? Were some candidates more popular outside of all of NYC.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11700.0 95.0 250.0 706.7 750.0 30280.0
The histogram above shows contributor sum on the x axis. Contributor sum is a variable created from two other variables: contribution amount (“contb_receipt_amt” column) and contributor name (“contbr_nm” column). It shows total contributions per person.
Once a contributors total contributions are summed as individual contributor sum, the average contribution amount increases from 140.10 to 714.80, and the median contribution increases from 27 to $250. The previously observed large difference between the mean and medium in single contribution is carried over into contributor sums. This difference once again reflects the positive skew of the graph’s distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11700.0 95.0 250.0 706.7 750.0 30280.0
A box plot showing the total contribution amount per person, closesly resembles the box plot for each individual contribution in the Univariate section above. We again see a large number of outliers either side of zero. The lone point at approximately 30,000 represents George Pataki’s contributions to his own campaign.
Plotting individual contribution sum over variables such as city or candidate, however, there are some large differences in mean contribution and quantiles. There certainly seems to be different “types” of contributor across candidate and location.
The graph above shows the proportion of small contributors (total contributions of less than or equal to $200) to total contributors for the five most popular candidates. We can see the Trump campaign had the largest percentage of small contributors.
Since we have seen some evidence so far that people living in NYC boroughs tend to contribute more, this graph also seems to also suggest more of Trump contributors may live outside NYC boroughs.
Note: the other category denotes contributors who received refunds on their contribution.
However, when we observe contributor type by total dollars contributed, a very different graph emerges. Now it’s clear that small contributions still make up a very small fraction of total dollars across all candidates.
This largely explains why the Clinton campaign raised so much more money than other candidates - a much higher percentage of her contributors were large donors. Incidentally, it also suggests that relying on a small contributor base is an ineffective fundraising strategy.
This graph formalizes the conclusions from the previous two. While the Clinton campaign did not have the highest average contribution per person, they received much more on average than her main political rivals - Trump and Sanders.
Is is interesting however that nationally, Trump and Sanders - the candidates with the smallest average contributions per person, far exceeded early expectations of winning the election. Perhaps a small avg contribution, while not here an effective fundraising strategy, is also indicative of a wider, broader, even more enthusiastic support base.
The two graphs above show that within the five NYC boroughs, Manhattan (here New York) is within a different class of contribution sizes. Interestingly, cities like Buffalo and Rochester look much more similar to the other boroughs than New York. In terms of contributor totals, New York is more comparable to cities like Larchmont and Scarsdale than it is the other boroughs.
Are there further correlations within a city between average total individual contribution and the popularity of certain candidates? Or are the larger average contributions we have seen for certain candidates more or less consistent regardless of where the contributors live?
Beside donation size, employer and work related data might be useful in identifying contributor “types.” For now, we can note the large number of contributors that are not associated with any specific company - the top two categories are “self-employed” and “retired.”
Graph showing dollars contributed over time.
The bivariate graphs above begin to explore the relationship between dollars contributed, location (city), and candidate. The average contribution seems to vary across city and political viewpoint.
It is interesting to see that while the most money was contributed in the final few months of the campaign, there is a sudden drop in total contributions in May of 2016. This was also the month that Donald Trump secured the Republican presidential nomination.
Also, the large number of contributors that were unaffiliated with any specific company was surprising to me.
So far we have seen that Hilary Clinton supporters from New York city are very likely to contribute much more on average that contributors elsewhere.
Also a contributor to Donald Trump is far more likely to contribute $200 or less than a contributor to Hilary Clinton.
Here we can see location by total contribution by political party. We can finally confirm that while Democratic candidates raised far more money than Republicans both inside and outside of New York City, Republican candidates were relatively more successful outside of NYC than they were in NYC.
The one exception to this trend is Staten Island - the one burrough in NYC that seems more or less split between Republican and Democratic total contributions.
In these graphs, we see the correlation between the median total individual contribution and support for individual candidates across the top ten cities in total dollars contributed. Support is measured in unique number of contributors for a candidate versus total unique contributors in a select city.
There is a strong correlation between cities that strongly support Clinton and high average individual contributions. The opposite is true for the Trump campaign, he is more popular in places with low contribution average per person.
Bernie Sanders was most popular in Brooklyn, Buffalo and Rochester - three cities with relatively low media contributions per person.
Do these correlations simply confirm the correlation between candidate and average individual contribution? I.e. where there are more Clinton supporters, there will also be higher average contributions?
Does introducing a variable for city add anything more to the analysis?
This graph expands on the last and helps address the questions produce from it. Interestingly, it seems that where Clinton is the most popular, people donate more money not just generally, but specifically to her campaign. Where Clinton is least popular - Staten Island and Rochester, she receives far less money on average from those who support her.
However, for Trump, the places where he is most popular - Staten Island and Rochester - do not correspond to higher average individual contributions. He received the most money on average from Chappauq - a place he is deeply unpopular!
Let’s combine the results from the above two graphs into a new graph below to better visualize these observations
In the two graphs above, I have replaced overall contributions with numbers for contributions to a specific campaign. The graphs map the relation between a candidate’s popularity and medium contribution to their campaign.
We can see here quite clearly that this relation appears to be reversed for Trump and Clinton’s respective campaigns.
Specifically, when we look at the top 10 cities, Trump appears to be more popular in the places where individuals tend to donate less money to his campaign. The opposite is true for Clinton: the cities that contribute more per person to her campaign also greatly support her in preference to any other candidate.
Let’s extend the analysis to all cities, to see how well this relationship holds
The graph above depicts every city in the data for which there was a contributor to either to the Trump or Clinton campaign. Cities for which were one or more unique contributor to either campaign are plotted twice - once for each candidate. Since no two candidates can both receive more than 50% of individual contributions, on each side of mid point of the x axis (at 50%), each city can be represented only once.
Although the graph becomes harder to read with so many data points on the graph, we can clearly observe a blue cluster on the left side of the graph below the $390 medium Clinton contribution. At the point where we cross the 50% mark, the blue points spread out with far fewer cities contributing at medians below 390 USD.
Furthermore, although there are some exceptions, contributions to Trump remain fairly consistent even in places where he is more popular.
If fact, if we take Trump contributions alone, we can see there is a wider spread of medium contributions on the left side of the graph where he is less popular.
Lastly, here is the same graph zoomed in across the 50% mark. Notice the cluster in the bottom left of the graph. These points represent cities where Trump is popular but with below average medium contribution.
Of course there is also a long vertical line at the 100% mark where there appear to be the largest number of contributions.
Here is a histogram counting each city twice - as the percentage of its total contributors to the Trump and Clinton campaigns. Unlike the last graph, those cities without any contributors to either Trump or Clinton are plotted at the zero x axis.
Evident again is large number of cities contributing exclusively to the Trump campaign. The only place on the graph with a larger count that cities at 100 percent Trump is those with zero percent Clinton.
Although this graph abstracts from contribution receipts, it does indeed show one measure by which the Trump campaign could be said to outperform Clinton’s.
Returning to the top 15 employers by contribution totals, we can that Clinton campaign was heavily favored among employees at most of the investment banks and law firms. In proportion to his total contributions, Jeb Bush also has a fairly strong showing here too.
Trump, on the other hand, while barely noticeble among the top companies, does better in unaffilliated workplaces such as homemakers, the self-employed and homemakers.
## [1] "Other" "Clinton" "Sanders" "Trump"
## [1] "Other" "Clinton" "Sanders" "Trump"
The graph above shows occupation rather than profession and measures populairty in number of unique contributors, rather than dollars contributed. All occupation are composed of at leat 15 unique contributors
We can see farily strong correlation between workplace data and candiate.
Specifically, Trump performs well among those in law enforcements/corrections professions and among those in manual labor/blue collar professions. Sanders is very popular among those in the hospitality industry (bartenders, servers) and some creative professions. Once these creatve professions cross into the more corporate workplace however, Clinton emerges as the clear favorite. Her campaign was heaily backed by those in higher paying white collar jobs such as public relations,lLaw, and finance. This last fact also explains why she is so dominant in the graph above depiciting top companies by employer contributions.
Changing topics, above is a final graph showing total contributions by month by party.
We saw previously that median contribution relates to the candiate contributed to. In the last section, we saw further that medium contribution per city specifically to the Trump and Clinton campaign depends upon that candiate’s general popularity in a city.
The most interesting multivariate relationship is between medium individual contribution and a candidate’s popularity by city.
This relationship discussed above is more complex than the more one between median contribution and candidate. Certainly, since for example, Clinton supporters tend to contribute more, where there are more Clinton supporter’s, there will also tend to be a higher median contribution. The important point is that the median contribution will tend to be higher than the medium specifically to the Clinton campaign.
The opposite is true for Trump. Where he is most popular, people tend to contribute less to his campaign than people do in places where he is unpopular.
Finally, some of the best evidence of “contributor types” emerges from graphing workplace data next to candidate suported. There is some good evidence that certain types of professions are more likley to support certain candiates than others.
The graphs above depict small donors ($200 or less) as a percent of total donors to each campaign, and small donor contributions as a percent of the total dollars contributed.
During the election, there was a lot of discussion about how each candidate’s campaign was funded, and especially which candidates relied on more on small-donors as a sign of grassroots strength.
When we graph contributions from New York State, however, it is evident that small contributors were insignificant overall, and not close to being a primary source of funding for any candidate.
Although the majority of contributors to Trump and Sanders are small-donors, once converted into dollars, these contributions are dwarfed by the minority group of larger contributors. In fact, one of the reasons the Clinton campaign was so much better funded than these campaigns is because she relied on small contributors to a much lesser extent.
There are two tentative conclusions that can be drawn/observed in the graph above.
The first is how geographically divisive the election was across New York State. The largest categories are those where a candidate has either 100% of the contributions or 0%. Although contributors to Trump and Clinton together made up more than 73% of total contributors, in a majority of cities either of these top candidates received either 100% or 0% of total contributions.
The second conclusion is that despite how successfully Clinton out-funded Trump in total dollars in every measurable way, there are vast swaths of New York State where she has no contributors and where Trump is very popular. Although Trump only had 38% the total number of contributors than Clinton, a far larger number of cities contributed at a rate of 100% to Trump than Clinton, and an even larger number of cities contributed nothing to Clinton than they did Trump.
The graph above depicts most clearly the central multivariate relationship of this analysis.
The relationship between Median Contribution and Popularity moves in different directions for the Clinton and Trump campaign. See the reflection section below for some further insight about what could explain this observation.
The largest struggle in this analysis was trying to find relationships between variables when the dataset is largely defined by a few very dominant categories/values. Specifically, New York City and Hilary Clinton were such an large source of contribution dollars, that any analysis focusing on total dollars alone would be largely determined by these values. It is primarily in response to this challenge, that I switched my analysis to looking at average contribution sizes.
This presented another difficulty: trying to find a unique-identifier by which to group the data. Although each contribution is given its own identification number, it seemed invalid to weigh equally one single 5 dollar contribution from one person with multiple 5 dollar contribution from someone else who contributed many times. There was no UID from a contributor in the dataset, so I had to devise my own (admittedly imperfect) solution based on name and zip code.
The results of my analysis also contain some suggestions at further datasets that could be used to support and deepen the conclusions drawn within:
An Individual’s Total contribution may be an indication of that individual’s wealth and income. In this case, a campaign with a lower median individual contribution may signify a candidate’s broader support among the middle or working class. In graph three above, we see these such contribution rates grouped into cities and varying with the popularity of a candidate. If Trump, for example, receives lower median contributions where he is most popular, can we conclude that Trump is more popular in cities with lower average income? Likewise, if Clinton is more popular where she gets the most money, then can this be explained precisely because Clinton is popular in wealthier communities? Introducing data on the median income for each city already included in the dataset, and plotting such median incomes in relation to a candidate’s popularity may reveal deeper and more insightful relations than those here.
Another area unexplored in the analysis above is a city’s population size. In graph two, we saw how popular Trump is in many NYS cities despite receiving contributions from only 20% of the people who donated. The only way his campaign can be popular in so many cities with so few total contributors, is if he was popular in those cities with fewer total contributors and thus smaller populations. My analysis appears to support and could be further supplemented by the urban/suburban/rural effect on political support.